Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated

Principal Component Analysis (PCA) is a multivariate analysis that reduces the complexity of datasets while preserving data covariance. The outcome can be visualized on colorful scatterplots, ideally with only a minimal loss of information. PCA applications, implemented in well-cited packages like EIGENSOFT and PLINK, are extensively used as the foremost analyses in population genetics and related fields (e.g., animal and plant or medical genetics). PCA outcomes are used to shape study design, identify, and characterize individuals and populations, and draw historical and ethnobiological conclusions on origins, evolution, dispersion, and relatedness. The replicability crisis in science has prompted us to evaluate whether PCA results are reliable, robust, and replicable. We analyzed twelve common test cases using an intuitive color-based model alongside human population data. We demonstrate that PCA results can be artifacts of the data and can be easily manipulated to generate desired outcomes. PCA adjustment also yielded unfavorable outcomes in association studies. PCA results may not be reliable, robust, or replicable as the field assumes. Our findings raise concerns about the validity of results reported in the population genetics literature and related fields that place a disproportionate reliance upon PCA outcomes and the insights derived from them. We conclude that PCA may have a biasing role in genetic investigations and that 32,000-216,000 genetic studies should be reevaluated. An alternative mixed-admixture population genetic model is discussed.


Supplementary Text 1 -Testing how PCA represents mixed and unmixed populations via a simulation
To test whether admixture can be inferred from the positioning of populations in a PCA plot, we simulated six ancestral populations with 20 cohorts per population over 50 markers. Allele frequencies (AF) per ancestral population per marker were generated at random. For each cohort, the allele frequencies for marker i was calculated as: (1) AFi + 0.01*RN with RN being a random number generated by a standard normal distribution. Colors were randomly assigned to the parental populations, which were marked with circles. Cross-population inbreeding was set to 20% per generation. Mixed cohorts were crosses between two mixed cohorts or mixed and unmixed cohorts. Unmixed cohorts were sampled within the cohort. All F1 cohorts were generated by averaging the allele frequencies of their parental cohorts. F1 cohorts also received the average colors of their parents and were marked with x. To calculate F2 cohorts, we repeated the analysis with the F1 cohorts replacing the parental cohorts. In such a manner, we calculated multiple generations with constant sample size. We calculated PCA for each generation, plotting the parental-and the nextgeneration cohorts over the primary two PCs. Convex hull was applied to the first two PCs to identify the corners.
In the initial generations, the ancestral populations were positioned in all corners, but not all the ancestral populations were at the corners. In Figure S1.1A, the red and cyan ancestral populations are positioned at the hull's center, undistinguished from the mixed cohorts. This pattern continued until the sixth generation ( Figure S1.1F). By then, one of the ancestral populations had gone extinct, and an admixed population took the corner. This process was not surprising because admixture populations were already positioned along the edges ( Figure S1.1A-H). It was only a matter of time until the disappearance of one of the original populations would change the hull's shape. Remarkably, this process occurred while some of the unmixed parental populations remained in the hull's center. By the eighth generation, half of the corner populations were admixed cohorts ( Figure S1.1H).
These results were replicated over multiple simulations, irrespective of the number of parental populations, sample sizes, marker sizes, and mixture proportions. Modulating some of these factors expedited or slowed the process, but the ultimate results were the same.
In summary, even in simple simulations that do not capture the complexity of genomic data and life processes, it is impossible to deduce whether populations are ancestral (unmixed) or mixed based on their position in a PCA plot. Furthermore, while simulations may show only some of the ancestral populations in the corners of an imaginary hull fit to the primary PCs, this effect is temporary. After several generations, unmixed populations will reach the corners. These findings are in agreement with our observations where admixed populations (e.g., Ashkenazic Jews) appeared both at the corner (e.g., Figure 11C) and the center (e.g., Figure 11A) of PCA plots. The question of how analyzing admixed groups with two or three ancestral populations affects findings for unmixed groups is illustrated through a typical study case in Box S2.1.

Box S2.1 -Studying the origin of Black using the primary and secondary colors
Though the previous analyses could not resolve the origin of Black (Box 2), there was a consensus that admixed cohorts can provide novel insights even though the issue of sampling bias remained unaddressed. An even-sample analysis (n=10) that included Cyan and Purple supported earlier suspicions that Black is a Green-Red admix ( Figure S2.1A). Concerned with the low sample sizes, the Black-is-Green group increased the Red, Blue, and Purple sample sizes (nRed, nBlue, nPurple=100), which both showed that Black is closer to Green and that Cyan is closer to Blue ( Figure S2.1B). The Black-is-Blue group argued that admixed individuals should be sampled at lower numbers (nCyan, nPurple=5) and since Blue and Green shared common origins ( Figure 5D), they are interchangeable and should be sampled in an even amount to Red and Black (nGreen + nBlue, nRed, nBlack=50). The results showed perfect Black-Blue, Green-Cyan, and Red-Purple overlap ( Figure S2.1C), at odds with the historical records of the origin of secondary colors. To test that, an independent group ventured into the field and collected Yellow. The results positioned Cyan and Purple close to their postulated ancestral cohorts and yielded a new genuine insight that Black is Yellow ( Figure S2.1D). The Black-is-Red group did not dispute this claim but argued that Yellow is a recent Black admixture and that Yellow should be excluded from future analyses to gain a true understanding of Black's ancient origins. The removal of Yellow from the analysis showed a complete Black-Red overlap and concrete evidence that Black has a Red origin ( Figure S2.1E). Yet those results could not be accepted as follow-up analyses provided credence to a new novel finding proving that Black has originated from Cyan ( Figure S2.1F). To consolidate these different observations, the groups gathered together and a new consensus has emerged. Since Black clustered mostly with Green, which is a shade of Blue ( Figure 5D), it was concluded that Black is the ancestor of Green and Blue, which partially explained many of the conflicting results reported thus far. That Green and Cyan clustered together ( Figure S2.1C) supported the idea that Black is part of the Green-Blue-Cyan clade ( Figure S2.1F). Since Cyan was positioned in the corner of the PC plot in all analyses ( Figure S2.1), it also became established that it is a novel primary color. Not everyone agreed with these conclusions. The origin of Papuans and Bouganvilleans or Northern Melanesians (NMs) has been extensively investigated in the literature 1 . PCA yielded conflicting results not only about their distance to other samples but to each other. When analyzed against East Asians and Oceanians, Papuans and Bouganvilleans clustered at the edge of the plot either together but separately from other samples 2 or separately from each other and other samples 3 . Perhaps consequently, Papuan ancestry is considered distinct from Asian ancestry, which, in turn, is considered distinct from non-Asian ancestries 1 . However, this is not obvious from PCA studies, as Papuans were also shown to cluster with Amerindians and close to Central-Southern Asia 4 .
It is easy to show how PCA can be used to generate conflicting results when NMs are analyzed alongside other highly admixed groups. We first show that when using even-sampling, the inclusion of NMs creates a European-Asian cluster that was not observed before ( Figure S2.2A) with NMs clustering separately from one another at the extreme edge with East Asians clustering along a "European-Oceanian cline," appearing as a Europeans-NMs admixed group. Using uneven sampling, as done in all the studies, yields various contradictory results with NMs clustering together and close to East Asians ( Figure S2.2B), appearing as a three-way tri-continental admix group distinct from each other and close to Pakistani (Hazara) ( Figure S2.2C), or remaining highly distinct and separate from other population but influencing the formation of an African-European cluster with Pakistani (Hazara) as an outgroup ( Figure S2.2D). Each of these results supports a different explanation for the origin of NMs, all equally mathematically valid, which is a biological implausibility. These examples show that PCA can produce results that may appear, in part, meaningful (i.e., Africans are separated from non-Africans in Figure S2.2A-C) based on our a priori knowledge. Still, even in this case, they provide biologically meaningless and contradictory outcomes. Results vary based on the sample size of each population. A) nall=20, B) nAf=nEu=nEA=50; nSa=nnM=20, C) nAf=nEA=300; nEu=50; nSa=500; nnM=24, D) nAf=10; nEu=60; nEA=100; nSa=200; nnM=24.

The case of pairwise comparisons
Several authors adopted a pairwise comparison scheme to assess the genetic similarity between two cohorts of interest, e.g., 5,[6][7][8] . This setting is prevalent in case-control analyses that seek overlap between the compared groups (e.g., cases and controls 9-12 ). We assessed whether this setting could lead to erroneous conclusions by analyzing two non-overlapping color populations ( Figure S2.3A). We found that in the presence of two other samples, they highly overlap ( Figure S2.3B) and may appear as part of the same cohort, thus altering the conclusions. Moreover, the latter analysis explains 99% of the variation compared to the former (94%), which may appear more reliable. We next analyzed two cohorts of interest alongside other populations. We found that whether the two cohorts overlap or not depends on the choice of the other populations ( Figure S2.3C-D).
Further analysis of the second and third PCs ( Figure S6) for the same populations in Figure S2.3 showed that the two Blue sheds overlap in both cases. By contrast, the two Green sheds did not, though both were closer to Red and Yellow than Green, their closest color population. Finally, an analysis of the first and third PCs ( Figure S7) showed that the Blue and Green sheds are separate in both cases. In the latter case, the Green shades were closer to both the Green and Yellow populations. Overall, these examples show that PCA outcomes as to whether populations overlap are independent of whether or not the populations are distinct. Instead, they are an artifact of the sampling scheme, for which no rules exist. These examples demonstrate how, when applied in a pairwise setting, PCA can lead to erroneous conclusions concerning clustering, identity, and distance cross-dimensionally. We next tested the performance of PCA in pairwise settings in human populations. In Figure S2.4A, we show that two Chinese populations, which PCA purports are a single homogeneous population, can be split if Japanese are included in the sampling scheme ( Figure S2

Supplementary Text 3: Evaluating the accuracy of correcting for population stratification with PCA
To test whether correcting for population stratification using PCA in a case-control association study improves the accuracy of the results, i.e., identifies the causal markers, we analyzed a dataset of 300 Europeans and 20 Puerto Ricans genotyped over ~129,000 markers as described in the main text. The Europeans were randomly divided into two groups and assigned random disease status. The Puerto Ricans were all assigned to the control group. Overall, we created two equally sized groups of 160 samples each. Ten markers were modified to be causal markers by randomly selecting genotypes that differ in frequencies between the cases and controls using the weights described in Table S3.1. We carried out six analyses, testing five dominant and one heterozygote model ( After generating the casual markers, we performed the logistic regression analysis using PLINK 2.0 (www.cog-genomics.org/plink/2.0/) 13 with correction for multiple testing using the Benjamini & Yekutieli (2001) 14 step-up false discovery control before and after adjusting for ten and two principal components. Our findings are shown in Table S3.2.
Comparing the results before and after PCA adjustment shows the advantage of avoiding PCA adjustment in all the analyses. PCA-adjusted results consistently yielded more false negatives and lower significance values for both ten and two PCs and dominant and heterozygote models. This is particularly notable in the fifth analysis, where the two significant causal markers were lost after PCA adjustment. The fourth analysis allows comparing between the p-value reported before and after PCA adjustment ( Figure S3.1). The p-values before PCA adjustment were lower than those after adjusting for two PCs, with the highest p-values calculated after adjusting for ten PCs. Similar results were obtained for random assignment of case-control status to all the samples (results not shown). Results of the six case-control association analyses. In each analysis, the data were calculated before and after adjusting for PCA using ten or two components. Only results with p<0.05 after correcting for multiple tests are shown. The last column indicates whether the causal markers found are the ones originally simulated.